Predictive Analytics Based on the NHANES 1999-2016 Dataset for the Hepatitis an Antibody Prediction: A Python Case Study

AUTHORS

Mai Thi Hoang Ta,Department of Computer Science,Lakehead University,Ontario, CANADA
Jinan Fiaidhi,Department of Computer Science,Lakehead University,Ontario, CANADA
Sabah Mohammed,Department of Computer Science,Lakehead University,Ontario, CANADA

ABSTRACT

Predictive analytics aims at building an analytical model in order to predict a target variable. This data science area currently has a lot of applications in many fields, such as in analytical customer relationship management, direct marketing, project risk management, clinical decision support systems, etc. Our research aims at performing predictive analytics on healthcare dataset to search for potentially valuable prediction models that are able to predict health-related target variables based on related input factors such as demographics, diet habit and relevant examination factors such as weight and height, etc. The healthcare data that have been used for our predictive analysis is collected from an important program conducted by the U.S. Centers for Disease Control and Prevention, which consists of 93,702 observations across 961 categories containing both interview and examination data from more than 93,000 participants. We have employed Multi-Linear Regression, Logistic Regression, Support Vector Classification, Support Vector Regression, Random Forest Classification (RFC), and Random Forest Regression algorithms to build various prediction models on the cleaned dataset. The result has shown that we have achieved good models based on the prediction of related social and healthcare factors (AUC ranging from 0.76 to 0.87), RFC has outperformed other classification algorithms, Fisher Score was a key feature selection algorithm, and the demographical factors have played a dominant role in the prediction of some questionnaire and laboratory target variables. Finally, based on the result of the best prediction models, we decided to develop a Hepatitis A Antibody prediction web prediction system.

 

KEYWORDS

NHANES, National Health and Nutrition Examination Survey, Random Forest, Support Vector Machine, Logistic Regression, Python, Machine Learning

REFERENCES

[1]    Ma C1, Schupp CW, Armstrong EJ, Armstrong AW. Psoriasis and dyslipidemia: a population-based study analyzing the National Health and Nutrition Examination Survey (NHANES). Journal of the European Academy of Dematology and Venereology. (2014)(CrossRef)(Google Scholar)
[2]    Birch RJ, Bigler J, Rogers JW, Zhuang Y, Clickner RP. Trends in blood mercury concentrations and fish consumption among U.S. women of reproductive age, NHANES, 1999-2010. Environmental Research. (2014)(CrossRef)(Google Scholar)
[3]    O'Neil CE, Nicklas TA, Fulgoni VL. Consumption of apples is associated with a better diet quality and reduced risk of obesity in children: National Health and Nutrition Examination Survey (NHANES) 2003-2010. Nutrition Journal. (2015)(CrossRef)(Google Scholar)
[4]    Wattigney WA, Irvin-Barnwell E, Pavuk M, Ragin-Wilson A. Regional Variation in Human Exposure to Persistent Organic Pollutants in the United States, NHANES. J Environ Public Health. (2015)(CrossRef)(Google Scholar)
[5]    Mozumdar A, Liguori G. Persistent increase of prevalence of metabolic syndrome among U.S. adults: NHANES III to NHANES 1999-2006. Diabetes Care. (2011)(CrossRef)(Google Scholar)
[6]    Befort CA, Nazir N, Perri MG. Prevalence of obesity among adults from rural and urban areas of the United States: findings from NHANES (2005-2008). J Rural Health. (2012)(CrossRef)(Google Scholar)
[7]    Lankester J, Patel C, Cullen MR, Ley C, Parsonnet J. Urinary triclosan is associated with elevated body mass index in NHANES. PLoS One. (2013)(CrossRef)(Google Scholar)
[8]    Zhu Y, Pandya BJ, Choi HK. Prevalence of gout and hyperuricemia in the US general population: the National Health and Nutrition Examination Survey 2007-2008. Arthritis Rheum. (2011)(CrossRef)(Google Scholar)
[9]    Breiman L. Random Forests. Machine Learning 45, no. 1: 5– 32. (2001)
[10]  Aizerman, Mark A., Braverman, Emmanuel M. & Rozonoer, Lev I. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control. 25: 821–837. (1964)
[11]  Lara J. Estimating the costs of achieving the WHO–UNICEF Global Immunization Vision and Strategy, 2006–2015. Bulletin of the World Health Organization, January 2008, 86 (1). (2008) (CrossRef)(Google Scholar)
[12]  Bart Baesens. Analytics in a big data world. The essential guide to data science and its applications. Published by John Wiley & Sons, Inc., Hoboken, New Jersey (2014)
[13]  Schinazi, Rinaldo B. Multiple Linear Regression. Handbook of Psychology. John Wiley & Sons, Inc., 2012:364-368 (2012) [4]J. Kimura and H. Shibasaki, Editors. Recent Advances in Clinical Neurophysiology. Proceedings of the 10th International Congress of EMG and Clinical Neurophysiology, (1995) October 15-19; Kyoto, Japan
[14]  N. Khanna, H. A. Eicher-Miller, H. K. Verma, C. J. Boushey, S. B. Gelfand and E. J. Delp, "Modified dynamic time warping (MDTW) for estimating temporal dietary patterns," 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Montreal, QC, Canada, (2017) pp. 948-952. (CrossRef)(Google Scholar)
[15]  A. Mohawish, R. Rathi, V. Abhishek, T. Lauritzen and R. Padman, "Predicting Coronary Heart Disease risk using health risk assessment data," 2015 17th International Conference on E-health Networking, Application & Services (HealthCom), Boston, MA, (2015) pp. 91-96. (CrossRef)(Google Scholar)
[16]  S. O. Torres, H. Eicher-Miller, C. Boushey, D. Ebert and R. Maciejewski, "Applied Visual Analytics for Exploring the National Health and Nutrition Examination Survey," 2012 45th Hawaii International Conference on System Sciences, Maui, HI, (2012) pp. 1855-1863. (CrossRef)(Google Scholar)
[17]  N. Khanna, H. A. Eicher-Miller, C. J. Boushey, S. B. Gelfand and E. J. Delp, "Temporal Dietary Patterns Using Kernel k-Means Clustering," 2011 IEEE International Symposium on Multimedia, Dana Point CA, (2011) pp. 375-380. (CrossRef)(Google Scholar)
[18]  L. R. Long and G. R. Thoma, "Computer assisted retrieval of biomedical image features from spine X-rays: progress and prospects," Proceedings 14th IEEE Symposium on Computer-Based Medical Systems. CBMS 2001, Bethesda, MD, (2001) pp. 46-50. (CrossRef)(Google Scholar)
[19]  Xiaoqian Xu, D. J. Lee, S. Antani and L. R. Long, "Pre-Indexing for Fast Partial Shape Matching of Vertebrae Images," 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06), Salt Lake City, UT, (2006) pp. 105-110. (CrossRef)(Google Scholar)

CITATION

  • APA:
    Ta,M.T.H.& Fiaidhi,J.& Mohammed,S.(2018). Predictive Analytics Based on the NHANES 1999-2016 Dataset for the Hepatitis an Antibody Prediction: A Python Case Study. International Journal of Bio-Science and Bio-Technology, 10(2), 13-26. 10.21742/IJBSBT.2018.10.2.03
  • Harvard:
    Ta,M.T.H., Fiaidhi,J., Mohammed,S.(2018). "Predictive Analytics Based on the NHANES 1999-2016 Dataset for the Hepatitis an Antibody Prediction: A Python Case Study". International Journal of Bio-Science and Bio-Technology, 10(2), pp.13-26. doi:10.21742/IJBSBT.2018.10.2.03
  • IEEE:
    [1] M.T.H.Ta, J.Fiaidhi, S.Mohammed, "Predictive Analytics Based on the NHANES 1999-2016 Dataset for the Hepatitis an Antibody Prediction: A Python Case Study". International Journal of Bio-Science and Bio-Technology, vol.10, no.2, pp.13-26, Jun. 2018
  • MLA:
    Ta Mai Thi Hoang, Fiaidhi Jinan and Mohammed Sabah. "Predictive Analytics Based on the NHANES 1999-2016 Dataset for the Hepatitis an Antibody Prediction: A Python Case Study". International Journal of Bio-Science and Bio-Technology, vol.10, no.2, Jun. 2018, pp.13-26, doi:10.21742/IJBSBT.2018.10.2.03

ISSUE INFO

  • Volume 10, No. 2, 2018
  • ISSN(p):2233-7849
  • ISSN(e):2208-9810
  • Published:Jun. 2018

DOWNLOAD